NAR Breakthrough Article A systematic survey of the Cys2His2 zinc finger DNA-binding landscape

نویسندگان

  • Anton V. Persikov
  • Joshua L. Wetzel
  • Elizabeth F. Rowland
  • Benjamin L. Oakes
  • Denise J. Xu
  • Mona Singh
  • Marcus B. Noyes
چکیده

Cys2His2 zinc fingers (C2H2-ZFs) comprise the largest class of metazoan DNA-binding domains. Despite this domain’s well-defined DNA-recognition interface, and its successful use in the design of chimeric proteins capable of targeting genomic regions of interest, much remains unknown about its DNA-binding landscape. To help bridge this gap in fundamental knowledge and to provide a resource for design-oriented applications, we screened large synthetic protein libraries to select binding C2H2-ZF domains for each possible three base pair target. The resulting data consist of>160 000 unique domain–DNA interactions and comprise the most comprehensive investigation of C2H2-ZF DNA-binding interactions to date. An integrated analysis of these independent screens yielded DNA-binding profiles for tens of thousands of domains and led to the successful design and prediction of C2H2-ZF DNA-binding specificities. Computational analyses uncovered important aspects of C2H2-ZF domain–DNA interactions, including the roles of within-finger context and domain position on base recognition. We observed the existence of numerous distinct binding strategies for each possible three base pair target and an apparent balance between affinity and specificity of binding. In sum, our comprehensive data help elucidate the complex binding landscape of C2H2-ZF domains and provide a foundation for efforts to determine, predict and engineer their DNA-binding specificities. INTRODUCTION The Cys2His2 zinc finger (C2H2-ZF) is the most frequently occurring DNA-binding domain in metazoan proteins, and is found in nearly half of human transcription factors (1,2). C2H2-ZF proteins have been implicated in a wide range of biological processes, including development (3), recombination (4) and chromatin regulation (5). Thus, a thorough understanding of how C2H2-ZF proteins specify their DNAbinding sites would be invaluable in mapping regulatory networks across a broad spectrum of eukaryotes. An individual C2H2-ZF domain contains a wellconserved DNA-binding structural interface and specifically recognizes its DNA target via amino acids occupying four key ‘canonical’ positions of an alpha-helix (6–9). C2H2-ZF proteins that bind DNA typically do so via tandem arrays of multiple, closely linked C2H2-ZF domains. An individual finger binds a contiguous three-nucleotide subsequence, 3′ to 5′, along with a potential fourth, crossstrand contact that overlaps the target of theN-terminal adjacent finger (Figure 1A). Unlike other structural classes of DNA-binding domains that typically offer a limited range of specificities, C2H2-ZF domains can specify a wide range of three base pair (3bp) targets (10–15). Due to the combination of largely modular binding and the wide range of DNA-binding specificities achievable via individual domains, C2H2-ZF arrays can, in theory, specify virtually any DNA site of interest. As such, C2H2-ZF domains serve as an attractive, general-purpose scaffold for engineering DNA-binding specificity. Indeed, efforts from the protein design community have resulted in chimeric proteins that use C2H2-ZF domains to target specific genomic locations at which to carry out particular functions. Such technology has enabled modification of transcrip*To whom correspondence should be addressed. Tel: +1 609 258 6385; Fax: +1 609 258 8020; Email: [email protected] Correspondence may also be addressed to Mona Singh. Tel: +1 609 258 7059; Fax: +1 609 258 1771; Email: [email protected] †These authors contributed equally to the paper as first authors. C © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] 1966 Nucleic Acids Research, 2015, Vol. 43, No. 3 Figure 1. Schematic of bacterial one-hybrid protein selections. (A) Schematic of F2 (top) and F3 (bottom) protein selections. Individual C2H2-ZF domains are selected in the context of a protein containing an array of three domains. The fixed C2H2-ZF domains are shown as solid colors while the variable C2H2-ZF domain is shown as a rainbow. Primary contacts with the bases are shown with arrows. The individual selections place a unique 3bp target in the appropriate position, noted as yellow bases (b1, b2 and b3), to assay the interaction of the variable C2H2-ZF domain. Underneath, the bases of the primary strand shown 5′ to 3′ are noted. Above each C2H2-ZF domain, the sequence of the recognition helix is shown N to C, with each variable position shown as a red ‘X’. (B) Schematic of the C2H2-ZF selection. (Top) Proteins are expressed as a 3-fingered protein-direct fusion to the omega subunit of RNA polymerase. C2H2-ZF domains are selected to bind the target sequence placed 10bp upstream of the promoter that drives the reporter genes, HIS3 and URA3, as described in Supplemental Methods 1b. In the example shown, C2H2-ZF domains would be selected from the F3 library to bind the 5′-ACC-3′ (shown in green). (Bottom) Two plasmids, the protein expression vector (here shown from the F3 library) and the target reporter vector, are transformed into the bacterial strain. Double transformants are plated on selective media. DNA is recovered from the cells and the region of the library vector that codes for the variable region is sequenced. Enriched amino acid sequences are shown as a sequence logo. Nucleic Acids Research, 2015, Vol. 43, No. 3 1967 tional outputs (16,17) and chromatin (18,19), as well as precise genome editing when fused with nuclease or recombinase domains (20–26). Despite the importance of C2H2-ZF proteins for natural systems and their successful use in protein design, our knowledge about their DNA-binding landscapes remains surprisingly incomplete. Indeed, the binding specificities of most C2H2-ZFs within genomes are not known. For example, in human, specificities are known for less than a hundred of approximately 700 C2H2-ZF proteins (15). In fruit fly, specificities have been successfully determined for only ∼20% of C2H2-ZF proteins, with 62% of the tested C2H2-ZFs failing characterization in a recent screen (10). Additionally, limited knowledge about context dependent effects––either between C2H2-ZF domains adjacent to one another within an array (27–29), among contacts within a single finger-DNA interface, or simply due to the position of a finger within an array (30)––has made the process of selecting, engineering and assembling synthetic C2H2-ZF proteins with desired DNA-binding specificities quite challenging (11,31). Further, while the welldefined interaction interface of C2H2-ZF domains has enabled the development of computational methods for predicting their DNA-binding specificities (32–40), the performances of these methods leave much room for improvement. Thus, a better understanding of the determinants of C2H2-ZF DNA-binding specificity would both enable highly reliable predictions of natural transcription factor binding sites and facilitate the design of engineered proteins with de novo binding specificities. In order to further our understanding of C2H2-ZFDNA binding as well as to provide a resource for prediction, selection and/or design-oriented applications, here we report the results of screening all 64 possible 3bp targets for interactions with C2H2-ZF domains frommultiple large protein libraries (41). This set of screens represents the most comprehensive and systematic survey of the C2H2-ZF DNAbinding landscape to date. We uncover pools of hundreds to thousands of C2H2-ZF domains capable of binding each 3bp target. Via cross-examination of these independent protein selections, we are able to simultaneously characterize binding profiles for thousands of C2H2-ZF domains and thus infer their quantitative DNA-binding specificities. For a diverse subset of the selected fingers, we confirm that these predicted specificities are highly concordant with experimentally determined specificities. We also demonstrate that the binding behavior gleaned from our large synthetic pools generalizes well to natural systems by adapting a simple nearest neighbor approach to accurately predict DNAbinding specificities for naturally occurring C2H2-ZF proteins. Furthermore, the diversity of our vast pools enables us to choose test fingers highly specific for nearly every 3bp target as well as to select three-finger combinations able to specify several challenging 9bp DNA sequences that proteins constructed via modular assembly failed to bind in previous efforts (11,31). Additional analyses presented here elucidate the complex nature of C2H2-ZF DNA-binding landscapes. For example, we observe finger-DNA interfaces that alternately either confirm or defy previously proposed position-specific amino acid-base recognition rules for C2H2-ZFs. We explore the important role of ‘within-finger’ context, demonstrating that the same amino acid in a given contacting position of the recognition helix may specify up to all four different bases depending upon the context provided by the amino acids occupying other key positions in that C2H2ZF domain. We also find that within-array domain position plays an important role in influencing base recognition, even when the neighboring finger context is the same. Lastly, we observe an apparent balance between affinity and specificity in interactions between C2H2-ZFs and DNA. Altogether, by developing an approach that integrates data from independent protein selections across all possible targets, we provide a foundational blueprint for further largescale investigations of the DNA-binding specificity of this important domain, as well as a valuable resource for predicting and designing DNA-binding specificity for C2H2ZFs. MATERIALS AND METHODS Overview of experimental approach To systematically survey the DNA-binding landscape of the C2H2-ZF domain, we used site-directed mutagenesis to assemble diverse C2H2-ZF protein libraries with six variable amino acid positions (41), as guided by prior engineering efforts (13,42–45) and the Zif268 structure (8). These libraries allowed each of the 20 possible amino acids in the -1, 1, 2, 3, 5 and 6 positions in regard to the alpha-helix of either the middle (F2) or C-terminal (F3) positions of a model Zif268-based system (Figure 1A and Supplemental Methods 1a). The quality, diversity and uniformity of sampling within these libraries have been validated (41). This experimental system vastly expands the repertoire of C2H2-ZF domains available for selection, as most previously reported libraries either considered fewer randomized residue positions (27,28) or used a coding scheme that did not permit all amino acids (29–30,46). A comprehensive set of protein selections was performed, wherein each of the 64 possible 3bp DNA targets was screened against our expansive C2H2-ZF libraries using an omega-based bacterial one-hybrid (B1H) system (12,21,47). Specifically, a variable finger was expressed in either the middle or C-terminal (F2 or F3) position of a threefingered protein where the adjacent, non-varying fingers have known specificities (SupplementalMethods 1b). These three-fingered proteins were expressed as fusions to the omega subunit of RNA polymerase; omega acts as an activation domain in this hybrid assay. For each selection, the 3bp site of interest was placed in a position relative to the targets of these fixed ‘anchor’ fingers such that upon binding, the anchor fingers will situate the test finger in a position in proximity to the desired target (Figure 1A). Only a positive interaction between the test finger and the site of interest will lead to an omega-guided recruitment of RNA polymerase and activate the transcription of a necessary HIS3 reporter gene (Figure 1B). Therefore, when these cells are grown on minimal media that requires the activation of HIS3 transcription, only a functional protein–DNA interaction will lead to survival of the bacteria (Figure 1B). To recover these positive protein–DNA interactions, cells 1968 Nucleic Acids Research, 2015, Vol. 43, No. 3 from each selection were pooled, DNA harvested and their C2H2-ZF constructs sequenced. The affinity of a protein–DNA interaction has been demonstrated to relate to growth rate in the B1H system, and the level of affinity required to activate HIS3 can be modulated by changing the concentration of 3-amino triazole (3-AT, a competitive inhibitor of HIS3) in the selection media (12,48–49). All of our protein selections were performed at low (2-mM 3-AT) and high (10-mM 3-AT) levels of the inhibitor, representing low and high stringency selections, respectively. The number of sequences recovered from a given selection that correspond to a particular protein– DNA interaction (which is proportional to the size of a colony) and the recovery of that interaction at a given inhibitor concentration are both related to the affinity of the interaction. We note that the B1H system does not directly measure the affinity of particular protein–DNA interaction. Indeed, for any particular colony, other factors besides affinity and/or inhibitor concentration may influence that colony’s growth rate. However, throughout this work we assume that, across a population of interactions recovered from a given selection, affinity of the protein–DNA interaction is the primary determinant of growth rate. Details of the bacterial one-hybrid selection procedures have been described previously (12,47). Modification to these procedures and details of the libraries used in this manuscript can be found in Supplemental Methods 1a–f. To characterize the DNA-binding specificity of a particular (test) C2H2-ZF protein, the procedure is reversed, whereby binding of the test C2H2-ZF to various sequences in a randomized DNA library (Supplementary Figure S1A) leads to activation of the HIS3 reporter (Supplementary Figure S1B). Building and selecting three-fingered C2H2-ZF libraries The B1H system was also used to select three-fingered arrays of C2H2-ZFs that specify particular 9bp DNA targets. These selections were performed, in principle, as previously demonstrated (13,21,42,50). Briefly, we used the pools of fingers recovered from our individual zinc finger selections as polymerase chain reaction (PCR) templates to build three-fingered C2H2-ZF libraries directed at binding particular 9bp targets. These 9bp targets were chosen based on the observation that C2H2-ZF proteins built by modular assembly had failed to bind them in two separate publications (11,31). For each 9bp target, a three-fingered ‘pool library’ was assembled. To create pool libraries, individual zinc finger pools corresponding to each 3bp subsite of the 9bp target were used as templates for PCR. For example, if targeting the sequence 5′-AAA-CCC-GGG-3′, the AAA, CCC and GGG pools would be used as the PCR template for each finger of the library. PCR primers were designed so that the resulting PCR pools could then be assembled by overlapping PCR into a three-fingered coding sequence of the order Nterminus-poolGGG-poolCCC-poolAAA-C-terminus (zinc fingers bind DNA anti-parallel to the 5′-3′ sequence of DNA). This process ensures that each finger in the pools used as templates for each position of the array has already shown the ability to bind the desired 3bp subsite in the previously performed individual-finger selections. Therefore, if we estimate that each pool for a given 3bp subsite contains between 100 and 1000 C2H2-ZFs, each assembled library offers a theoretical complexity of 106 to109 three-fingered combinations from which to find compatible sets of zinc fingers that are uniquely suited for the context offered by the desired 9bp target. The final three-fingered PCR products were digested and cloned into the B1H omega-based expression vector. The 9bp target of interest was placed 10bp upstream of the promoter that drives HIS3 expression in the B1H system and C2H2-ZF proteins were selected (as described above) from the new corresponding three-fingered library. Cells were harvested and their C2H2-ZF constructs were sequenced to find enriched protein sequences. For each 9bp target, sequenced candidates that closely resembled the enriched consensus of protein sequences were chosen and their specificities tested byB1Hbinding site selections as described above. Affinity-related green fluorescent protein activation in yeast Selected C2H2-ZFs of interest were screened for their ability to activate a green fluorescent protein (GFP) reporter in yeast as previously described (17,41). Each C2H2-ZF was cloned into the yeast genome to be expressed from anACT1 promoter as a direct C2H2-ZF-estrogen receptor-VP16 fusion (ZEV). Binding sites to be tested were cloned into a minimal GAL1 promoter upstream of a GFP cassette on a yeast centromere (CEN) plasmid containing a URA3 cassette. These plasmids were then transformed into the appropriately constructed yeast strains in order to pair the desired ZEV-binding site combination to test. In each experiment, positive and negative controls (the original high affinity Z3EV system paired with either its optimal target or an empty vector, respectively) were also performed to control for experimental error. Next, for each sample tested, transport of the ZEV construct was induced with the addition of 100-nM β-estradiol and cultures were grown for 12 h. Themean fluorescence of each samplewasmeasuredwith a BD LSRII Multi-Laser Analyzer with High Throughput Sampler (BD Biosciences, Sparks, MD, USA). Mean fluorescence values were determined from at least 50 000 cells. Each C2H2-ZF-binding site pair was assayed in triplicate and means were normalized to the positive control. Previous work has shown that normalized GFP expression can be related to known levels of relative affinity (17,41). For this, a key is provided in our figure of normalized GFP expression when Z3EV is paired with a suite of binding sites that have known affinities relative to the Z3EV optimal target. Processing, filtering and quality analysis of C2H2-ZF protein selection data Following each protein selection, C2H2-ZF constructs harvested from cells were Illumina sequenced. The base-2 log of observed sequence counts were used to compute frequency distributions (per selection and considering only varied positions). Sequences with very low (<0.0001) frequencies were removed from each distribution and the resulting data were processed and filtered for quality according to an entropy-based procedure as described previously Nucleic Acids Research, 2015, Vol. 43, No. 3 1969 (41). Additional details regarding the processing and filtering of protein selection data are provided in Supplemental Methods 2a. Various measures to ensure data quality and consistency were taken, as described in the Results section and further detailed in Supplemental Methods 2b. Entropy andmutual information analysis of protein selections For a given 3bp DNA target, we considered all protein sequences selected to bind it in the data set and computed, for each amino acid in each variable position in the alphahelix (-1, 1, 2, 3, 5, 6), the fraction of sequences in which it was observed in that position. These were then used to derive the Shannon entropy per position as − i pi(log pi), where pi is the fraction of distinct sequenceswith amino acid i in the position under consideration. For both the entropy and mutual information analyses, the frequency with which each protein sequence is observed within a 3bp target was ignored. In order to examine the level of dependence between particular residue positions of the alpha-helix and particular base positions of the bound 3bp DNA regions, we performed a mutual information analysis. For each variable amino acid position i, we computed its distribution of amino acids Ai by calculating the fraction of times a specific amino acid was observed in this position across the data set. Similarly, for each base position j, we computed the distribution of bases Bj. We then computed the mutual information, MI(Ai,Bj) = H(Ai) – H(Ai|Bj), where H(X) is the Shannon entropy of the distribution of random variable X, as described above. The mutual information was then normalized to a value between 0 and 1 as S = MI(Ai, Bj)/min(H(Ai), H(Bj)). Using this normalization, if Ai and Bj are independent, S is zero, whereas if Ai is a deterministic function of Bj, S is 1. In order to assess the significance of the level of normalized mutual information observed, we performed 1000 randomization experiments. Specifically, for the set of observed finger-DNA interfaces, we decoupled the interfaces by randomly permuting the DNA targets with respect to the helices that bound them. We then repeated our mutual information analysis with respect to this set of random interfaces and computed an empiricalP-value based upon the fraction of times the normalized mutual information was higher for a pair in the randomized data than was observed in the actual data. Core sequence representation of C2H2-ZF proteins For each C2H2-ZF domain, we also consider its ‘core sequence’ representation, defined by the amino acids present in the four canonical positions of the recognition helix (i.e. -1, 2, 3 and 6). Because positions 1 and 5 can vary, each socalled ‘core sequence’ can correspond tomultiple C2H2-ZF domains observed in our data set. Thus, when we refer to a core sequence, we are referring to all of the sequences with those amino acids occupying the -1, 2, 3 and 6 positions of the C2H2-ZF domain. The frequency of a core sequence within a specific data set (e.g., in a specific selection for uncovering domains binding a particular 3bp target or across all target selections in either F2 or F3) is defined to be the sum of the frequencies of all full-length protein sequences that share that core sequence representation. Computing binding profiles for core sequences and computationally inferring DNA-binding specificities via lookup For either the F2 or F3 protein selections, for each core sequencewe considered the frequencywithwhich it was found in each of the 64 possible 3bp targets. For each core sequence, we then normalized these frequencies so that they summed to 1 across the 64 3bp targets, and thereby obtained a binding profile that represents a probability distribution specifying the preference of a core sequence for each 3bp target. We denote this binding profile for a specific core sequence as . To predict the DNA-binding specificity of a core sequence, we computed the probability of each nucleotide n in position b1 as pn,1 = i,j bpn,i,j. The predicted probabilities with which the nucleotides occur in positions b2 and b3 were computed analogously. We refer to this method for predicting the DNAbinding specificity for a core sequence as the ‘lookup’ procedure, as it is based upon finding the core sequence in question across all of the protein selections. Finally, for several analyses described below, each core sequence was assigned to its most preferred 3bp target by choosing the nucleotide in each position that has the highest inferred probability according to this lookup procedure. Processing binding site selection data Illumina sequencing and analysis were used to uncover binding site preferences selected by candidate C2H2-ZF proteins via B1H selection. The data were filtered for quality, searched for enrichedmotifs and visualized via sequence logos as described in our previous work (41). We provide further details regarding the processing and filtering of binding site selection data, motif finding, clustering and visualization in Supplemental Methods 2c. Clustering C2H2-ZF domains within preferred targets For the F2 and F3 protein selection data separately, we assigned each observed core sequence to the target for which it had the highest preference, as computed via the lookup procedure described above. For each 3bp target, we saw a diverse group of core sequences assigned to it in either the F2 or F3 selections. We obtained the full six amino acid sequences, including positions –1,1,2,3,5 and 6, of the corresponding zinc fingers and clustered them into ‘specificity groups’ of similar sequences that offer alternative binding strategies for that particular 3bp target. In particular, each 3bp target was described as a graph with all observed six amino acid sequences represented as nodes in the graph. The similarity between two sequences was computed using the BLOSUM62 matrix (51) and normalized to be between 0 and 1. Two nodes were connectedwith an edge if the similarity score between the two corresponding protein sequences exceeded 0.25. We used the network clustering program SPICi (52) with a minimum cluster size of six. Finally, for each cluster, we visualized the sequences within it via a sequence logo (53). 1970 Nucleic Acids Research, 2015, Vol. 43, No. 3 Nearest neighbor decomposition to predict C2H2-ZF binding specificity In order to extend the predictive scope of our data to arbitrary C2H2-ZF domains thatmay not share a core sequence with any domain recovered in our screens, we adapted the classic nearest neighbors approach. In a typical implementation of nearest neighbors prediction, given a core sequence C which is not contained in our data set, we would look for all other core sequences in our data set that are most sequence-similar toC and predict a specificity by computing an average across the DNA-binding specificities of all such neighbors. Our approach improves on this classic paradigmby leveraging (i) prior structural knowledge about which residue positions of the core sequence are known to be most important for determining the base at a given position (54) and (ii) information about which amino acids frequently substitute for one another. Specifically, when predicting the base specificity at bi (i.e. preferences at base position i) for a core sequenceC, we first hierarchically ranked neighboring core sequences of C found in our data set with respect to bi, and next took a weighted average across the specificities inferred via the lookup method over the top 25 such neighbors. Neighbors considered for use in predicting the specificity at bi include all core sequences in our data set that are exactly hamming distance 1 fromC and share the same residue as C in the amino acid position that is known to interact with bi in the canonical structural bindingmodel. For example, when predicting the base at position b1, we do not consider neighbors that vary from C in position a6. This leaves us with (potentially) 57 hamming distance 1 neighbors to be hierarchically ranked. The first level of the ranking hierarchy corresponds to the order in which we allow positions of neighboring core sequences to vary with respect to C. This ordering is chosen based upon previous structural analysis (see Table 3 in (54)). That is, we always allow the core sequence position with the least amount of structural evidence for interacting with bi to vary first, the second least second and so on. For example, when predicting b1, we first look at core sequences that vary from C at position a-1, followed by core sequences that vary at the position a2, and finally core sequences that vary at position a3. For b2, the order of variation is a2, a-1, a6, and for b3 the order of variation is a6, a3, a2. Multiple neighbors that vary in the same position with respect toC are sub-ranked via scores derived from a PAM30 matrix. Specifically, for a neighbor N of C, the substitution score S = PAM30(N,C)/PAM30(C,C) is computed, where PAM30(N,C) is simply the sum of values from the PAM30 matrix for substituting sequence N for sequence C. Numerators of all such scores for a set of neighbors are shifted positively so that the worst possible substitution corresponds to a score of 0, and the exact match to C, if present, receives the highest possible ranking. Once neighbors for bi have been ranked according to the above algorithm, for each of the top 25 neighbors for bi, the specificities inferred from the lookup procedure are computed and a weighted average is taken across the 25 neighbor’s predicted specificities for position bi, with weights corresponding to the aforementioned PAM30 substitution scores. These steps are repeated separately for base positions b1, b2 and b3 to obtain the complete predicted 3bp DNAbinding specificity of core sequence C. A web-form for predicting a C2H2-ZF domain’s binding specificity using the nearest neighbor decomposition approach is available at http://zf.princeton.edu/b1h/. A database of naturally occurring C2H2-ZF DNA-binding specificities We previously gathered C2H2-ZF protein DNA-binding specificities obtained from four resources including the JASPAR database (55), the UniProbe database (56), a database of human transcription factors (57) and the FlyFactorSurvey database (58), as described in (40). This database of experimentally determined transcription factor specificities was updated with ChIP-seq data collected by the ENCODE project (59). After merging redundant protein sequences, the combined data set contains 158 proteins. We used this data set for comparing the performance of our nearest neighbor decomposition method (NN) to the performances of other state-of-the-art C2H2-ZFDNA-binding specificity prediction methods. However, substantial overlap was observed between proteins listed in this test data set and the data used to train previously published prediction methods (including those based on random forests (RFs) and support vector machines (SVMs) (39,40)). Overall, 104 out of the 158 proteins in our test data set contain at least one instance of a C2H2-ZF domain used in the training of at least one of the SVM, RF or NN methods. Thus, we compared performance of the three methods on the remaining 54 proteins of the test data set. Additional details of the processing of this test set, as well as prediction of DNA-binding specificities on the test set, are provided in Supplemental Methods 2d. Evaluating the quality of C2H2-ZF DNA-binding specificity predictions After predicting the DNA-binding specificity of a given C2H2-ZF (or an array of C2H2-ZFs), we evaluated agreement between the predicted and experimentally determined specificities. In the case where the correct alignment of the experimental and predicted PWMs was known a priori (i.e. when predicting specificity for a single domain and comparing it to a 3bp subsite selection), we compared pairs of columns (base positions) of the aligned position weight matrices (PWMs). We considered a base position to have been correctly predicted if the Pearson correlation coefficient (PCC) between the predicted and experimental columns of nucleotide frequencies for that base position is at least 0.5. When the correct alignment of the experimental and predicted PWMs was not known a priori (i.e. when predicting specificity for naturally occurring C2H2-ZFs), we used a previously published alignment technique, where alignment scores are based on an information content corrected version of the PCC (40). A more detailed description of our evaluation pipeline is provided in Supplemental Methods 2e. Nucleic Acids Research, 2015, Vol. 43, No. 3 1971

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A systematic survey of the Cys2His2 zinc finger DNA-binding landscape

Cys2His2 zinc fingers (C2H2-ZFs) comprise the largest class of metazoan DNA-binding domains. Despite this domain's well-defined DNA-recognition interface, and its successful use in the design of chimeric proteins capable of targeting genomic regions of interest, much remains unknown about its DNA-binding landscape. To help bridge this gap in fundamental knowledge and to provide a resource for d...

متن کامل

De novo prediction of DNA-binding specificities for Cys2His2 zinc finger proteins

Proteins with sequence-specific DNA binding function are important for a wide range of biological activities. De novo prediction of their DNA-binding specificities from sequence alone would be a great aid in inferring cellular networks. Here we introduce a method for predicting DNA-binding specificities for Cys2His2 zinc fingers (C2H2-ZFs), the largest family of DNA-binding proteins in metazoan...

متن کامل

Natural and artificial zinc finger proteins

Zinc finger proteins acquire DNA-binding ability by Zn (II) complexation. In the zinc finger domain of the Cys2His2 type, each finger is approximately 30 amino acid residues long and consists of a simple ββα–fold stabilized by chelation of a zinc ion with the conserved Cys2His2 residues. A zinc finger motif of Cys2His2 offers an attractive framework for the design of a novel DNA-binding protein...

متن کامل

Assessment of major and minor groove DNA interactions by the zinc fingers of Xenopus transcription factor IIIA

Zinc finger proteins of the Cys2His2 class are DNA sequence-specific transcription factors. Previous structural studies of zinc finger protein-DNA complexes have shown that amino acids in the finger tip and alpha-helix regions within individual finger domains make base-specific contacts with the major groove of DNA. The nine finger protein transcription factor IIIA (TFIIIA) from Xenopus oocytes...

متن کامل

In-depth study of DNA binding of Cys2His2 finger domains in testis zinc-finger protein

Previously, we identified that both fingers 1 and 2 in the three Cys2His2 zinc-finger domains (TZD) of testis zinc-finger protein specifically bind to its cognate DNA; however, finger 3 is non-sequence-specific. To gain insights into the interaction mechanism, here we further investigated the DNA-binding characteristics of TZD bound to non-specific DNAs and its finger segments bound to cognate ...

متن کامل

The Krüppel-associated box (KRAB)-zinc finger protein Kid-1 and the Wilms' tumor protein WT1, two transcriptional repressor proteins, bind to heteroduplex DNA.

Zinc finger proteins of the Cys2His2 class represent a large group of DNA-binding proteins. A major subfamily of those proteins, the Krüppel-associated box (KRAB) domain-containing Cys2His2-zinc finger proteins, have been described as potent transcriptional repressors. So far, however, no DNA-binding sites for KRAB domain-containing zinc finger proteins have been isolated. Using a polymerase ch...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015